{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Question-Answering with a Multi-Layer Bidirectional RNN\n",
    "\n",
    "### Learning Objectives: \n",
    "\n",
    "- Build a first simple RNN model in Keras/TensorFlow for Question-Answering \n",
    "- Build and improve a Multi-Layer Bidirectional RNN in Keras/TensorFlow\n",
    "- Evaluate the results on a test set\n",
    "- Try out our model several questions in the test/validation sets\n",
    "\n",
    "\n",
    "This notebook demonstrates how to build and train a **Multi-Layer Bidirectional Recurrent Neural Network (RNN)** to perform Question-Answering tasks over a Corpus of Documents.\n",
    "\n",
    "**Question-Answering** consists in extracting precise and concise information (in the form of a short phrase or span of words) from a group of documents in order to answer a question. It involves a combination of **Document Retrieval** and **Natural Language Processing** techniques:\n",
    "- **Step1:** Given the question, we retrieve the *k* most relevant documents from our Corpus. \n",
    "- **Step2:** We run a Text Comprehension model (our RNN) over the retrieved documents to extract the most satisfying answer in the form of a concise span of words. *This notebook will focus on this step.*  \n",
    "\n",
    "The key idea behind Question-Answering is to refuse to stop after Step1: **instead of letting the user go through all *k* documents and read through long paragraphs of text to find their answer, the model keeps going and finds it for them.**\n",
    "\n",
    "<img src=\"../assets/qa/QA.png\" alt=\"image\" style=\"width: 500;\"/>\n",
    "\n",
    "### Datasets & Resources\n",
    "\n",
    "- **Training/Validation Set:** <a href=\"https://rajpurkar.github.io/SQuAD-explorer/\" target=\"_blank\">Ground-Truth Question-Answer triplets</a> extracted from the Stanford Question Answering Dataset (**SQuAD**) consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.\n",
    "- **Embeddings:** We'll use the <a href=\"https://tfhub.dev/google/nnlm-en-dim128/2\" target=\"_blank\">NNLM text embedding module</a> trained from the English Google News 200B corpus.\n",
    "- **Research:**   \n",
    "    * **<a href=\"https://arxiv.org/pdf/1704.00051.pdf\" target=\"_blank\">Reading Wikipedia to Answer Open-Domain Questions</a>** by Chen et Al (2017),\n",
    "    * **<a href=\"https://arxiv.org/pdf/1606.05250.pdf\" target=\"_blank\">SQuAD original Research paper</a>**  by Rajpurkar et al (2016).\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Before we start: Import Tensorflow & Other Tools"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basics for Data Manipulation\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Tensorflow and Keras tools\n",
    "import tensorflow as tf\n",
    "import tensorflow_hub as hub\n",
    "\n",
    "from tensorflow.keras.models import Sequential, Model\n",
    "from tensorflow.keras.layers import (\n",
    "    Layer,\n",
    "    Input,\n",
    "    Dense,\n",
    "    Concatenate,\n",
    "    Masking,\n",
    "    Embedding,\n",
    "    Dropout,\n",
    "    Softmax,\n",
    "    Dot,\n",
    "    Lambda,\n",
    "    SimpleRNN, \n",
    "    GRU, \n",
    "    LSTM, \n",
    "    Bidirectional\n",
    ")\n",
    "\n",
    "from tensorflow.keras import regularizers\n",
    "from tensorflow.keras.preprocessing.text import Tokenizer\n",
    "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
    "from tensorflow.keras.preprocessing.text import text_to_word_sequence\n",
    "\n",
    "from tensorflow.keras.callbacks import ModelCheckpoint\n",
    "\n",
    "from tensorflow.keras.utils import plot_model\n",
    "from tensorflow.keras.models import Model\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "import string"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First Overview of the Data\n",
    "*You'll find the test data as well as an extract of the training data (with 15,000 datapoints) in this github - in the 'assets' folder.*  \n",
    "Our **Training QA dataset** consists in ground-truth question-context-answer triples extracted from the SQuAD Dataset. Our model will take as input both the question and the context, and output the predicted start_word and end_word:\n",
    "* **question**: the question\n",
    "* **context**: the paragraph to look into for the answer\n",
    "* **text**: the answer as a span of text\n",
    "* **c_id**: the context id (some questions are asked on the same context)\n",
    "* **start_word**: index of the first answer word in **context**\n",
    "* **end_word**: index of the last answer word in **context**\n",
    "\n",
    "Our **Test QA dataset** is extracted from the SQuAD Dev Set and consists in question-context pairs as well as up to 4 \"human answers\" to compare our results with - the idea here is to allow for more flexibility when evaluating our model and ignore minor differences in ambiguous cases. For example, \"*the sky is blue*\" and \"*blue*\" may both be considered as correct answers to the question \"*what color is the sky?*\".\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Loading the training data i.e. a dataframe of 15,000 ground-truth question-context-answer triples from the SQuAD Dataset\n",
    "data = pd.read_csv('../assets/qa/squadlite.csv')   \n",
    "data.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Loading the test data i.e. a dataframe of 5,000+ questions/answers to test on model on\n",
    "test = pd.read_csv('../assets/qa/squadtest.csv')   \n",
    "test.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Embedding & Prepping the data\n",
    "**Embedding** consists in mapping words or phrases to vectors of real numbers. Conceptually it involves a mathematical projection from a space with many dimensions per word to a continuous vector space with a much lower dimension. Here, we can use the <a href=\"https://tfhub.dev/google/nnlm-en-dim128/2\" target=\"_blank\">NNLM text embedding module</a> from TF-Hub, a 128-d embedding built from the English Google News 200B corpus.\n",
    "\n",
    "**Padding** is a way to tell sequence-processing layers that certain timesteps in an input are missing, and thus should be skipped when processing the data. This is particularly useful when we want to use the same model with input data of different lengths: it will be crucial here as questions and paragraphs do not all have the same number of words in them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Splitting the questions/paragraphs into words and embedding them...\n",
    "pars = []\n",
    "ques = []\n",
    "embed = hub.KerasLayer(\"https://tfhub.dev/google/nnlm-en-dim128/2\") #NNLM\n",
    "\n",
    "for text in data.context:\n",
    "    words = np.array(text_to_word_sequence(text))\n",
    "    pars.append(embed(tf.constant(words)))\n",
    "for text in data.question:\n",
    "    words = np.array(text_to_word_sequence(text))\n",
    "    ques.append(embed(tf.constant(words)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now padding...\n",
    "padded_pars = pad_sequences(pars, padding=\"post\",dtype='float32')\n",
    "padded_ques = pad_sequences(ques, padding=\"post\",dtype='float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Key Dimensions\n",
    "batch_size = np.shape(padded_pars)[0] #Batch Size\n",
    "max_paragraph_length = np.shape(padded_pars)[1] #Time Steps\n",
    "max_question_length = np.shape(padded_ques)[1] #Time Steps\n",
    "emb_dim = np.shape(padded_pars)[2] #Embed Dimension\n",
    "\n",
    "print(\"Shape of the Padded Embedded Paragraphs: \", np.shape(padded_pars))\n",
    "print(\"Shape of the Padded Embedded Questions: \", np.shape(padded_ques))\n",
    "print(\"i.e. (Batch Size, Sequence Length, Embed Dimension)\")\n",
    "\n",
    "# Our y data (i.e the positions of the answer's start and end words)\n",
    "y_start_word = np.array(data.start_word)\n",
    "y_end_word = np.array(data.end_word)\n",
    "print(\"Shape of the Y Train set for Start Word: \", np.shape(y_start_word))\n",
    "print(\"Shape of the Y Train set for End Word: \", np.shape(y_end_word))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train & Validation\n",
    "p_train, p_val, q_train, q_val, ys_train, ys_val, ye_train, ye_val = train_test_split(\n",
    "    padded_pars, padded_ques, y_start_word, y_end_word, test_size=0.1, random_state=30\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics and Evaluation\n",
    "Let's define our **evaluation metrics** right away (for more details, see the  <a href=\"https://arxiv.org/pdf/1606.05250.pdf\" target=\"_blank\">SQuAD original Research paper</a>  by Rajpurkar & al.): \n",
    "* **Exact Match**: This metric measures the percentage of predictions that match the ground truth answers exactly.\n",
    "* **F1 Score**: This metric measures the average overlap between the prediction and ground truth answer, by computing their F1 (harmonic mean of precision and recall):  \n",
    "    * **P = Precision** = the number of *true positives* divided by the total of *predicted positives* = tp /(tp+fp). In our case here, this is the number of words appearing in both prediction and ground truth answer, divided by the total number of words in the *prediction*.\n",
    "    * **R = Recall** = the number of *true positives* divided by the total of *actual positives* = tp/(tp+fn). In our case here, this is the number of words appearing in both prediction and ground truth answer, divided by the total number of words in the *ground truth answer*.\n",
    "    * **F1 Score** = 2PR/(P + R)\n",
    "    * While the F1 Score is less strict than EM, it is still a very exigent metric: as an example, for the answer to the question \"Who committed the crime?\", the F1 Score between \"The butler did it\" and \"The butler\" is only 0.67 even though one could argue that both answers are correct.\n",
    "\n",
    "For reference, in <a href=\"https://arxiv.org/pdf/1704.00051.pdf\" target=\"_blank\">Reading Wikipedia to Answer Open-Domain Questions</a> (2017), Chen et Al managed to obtain a **69.5% EM Score** and a **78.8% F1 Score** on the SQuAD Dev Set with their most finely-tuned model (using a dataset of 100,000+ datapoints)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's create helper functions to measure these metrics\n",
    "# Both exact_match & f1_score take strings as inputs\n",
    "\n",
    "def exact_match(pred, truth):\n",
    "    truth = str(truth).replace(\"-\", \" \")\n",
    "    truth = \"\".join(l for l in truth if l not in string.punctuation)\n",
    "    return np.sum(str(pred).lower() == str(truth).lower())\n",
    "\n",
    "\n",
    "def f1_score(pred, truth):\n",
    "    p = text_to_word_sequence(str(pred))\n",
    "    t = text_to_word_sequence(str(truth))\n",
    "    tp = [i for i in p if i in t]\n",
    "    if len(tp) == 0:\n",
    "        f1 = 0\n",
    "    else:\n",
    "        precision = len(tp)/len(p)\n",
    "        recall = len(tp)/len(t)   \n",
    "        f1 = 2 * (precision * recall) / (precision + recall)\n",
    "    return f1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's a Recurrent Neural Network (RNN)?\n",
    "A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. They are particularly useful to solve NLP problems: in a sentence or phrase, the sequence of words is just as important - if not more - as the individual words themselves.\n",
    "  \n",
    "There are built-in RNN layers in Keras, ready to use to quickly build recurrent models without having to make difficult configuration choices:\n",
    "* keras.layers.**SimpleRNN**, a basic fully-connected RNN,\n",
    "* keras.layers.**LSTM**, first proposed in Hochreiter & Schmidhuber 1997 to address the *Vanishing Gradient* problem,  \n",
    "* keras.layers.**GRU**, first proposed in Cho et al. 2014 as a simplified version of LSTM,\n",
    "\n",
    "You can check the <a href=\"https://keras.io/api/layers/recurrent_layers/\" target=\"_blank\">RNN API documentation</a> for more information, or go through the <a href=\"https://www.tensorflow.org/guide/keras/rnn\" target=\"_blank\">TensorFlow guide on RNN</a> for simple examples and setups.  \n",
    "Also check out <a href=\"http://proceedings.mlr.press/v37/jozefowicz15.pdf\" target=\"_blank\">**An Empirical Exploration of Recurrent Network Architectures**</a> by Jozefowicz et al. for more details about the differences between GRU and LSTM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " <img src=\"../assets/qa/First.png\" alt=\"image\" style=\"width: 100%;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This first model will be the foundation for all our following models. We have:\n",
    "* **Pre-Embedded Input**: paragraph p and question q both pre-embedded with NNLM - we just did that! Yay!\n",
    "* **GRU Layers**: straightforward GRU layers, one for p and one for q \n",
    "* **Weighthed Average for q'**: to obtain a single vector q' - the A weight vector within the Softmax will be learned\n",
    "* **Bilinear Similarity Layer on p' and q'**: to obtain start and end probabilities (this will give us the answer span) - we can obtain this with using a Dense layer (with a 'linear' activation function) followed by a Dot product \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First Input = Paragraphs / Straightforward GRU Layer\n",
    "paragraphs = Input(shape=(max_paragraph_length, emb_dim), name=\"pars_in\")\n",
    "p = Masking(mask_value=0)(paragraphs)\n",
    "p = GRU(\n",
    "    256,\n",
    "    return_sequences=True,\n",
    "    name=\"pars_out\",\n",
    "    kernel_regularizer=regularizers.l2(0.002),\n",
    "    kernel_initializer=\"glorot_normal\",\n",
    ")(p)\n",
    "# Output is = a 128d vector per word in the paragraph (None, max_paragraph_length, 128).\n",
    "\n",
    "# Second Input = Questions / Straightforward GRU\n",
    "questions = Input(shape=(max_question_length, emb_dim), name=\"ques_in\")\n",
    "q = Masking(mask_value=0)(questions)\n",
    "q = GRU(\n",
    "    256,\n",
    "    return_sequences=True,\n",
    "    name=\"ques_gru\",\n",
    "    kernel_regularizer=regularizers.l2(0.002),\n",
    "    kernel_initializer=\"glorot_normal\",\n",
    ")(q)\n",
    "# Output is = a 256d vector per word in the paragraph (None, max_question_length, 256).\n",
    "\n",
    "# Weighted Average to obtain the single vector q'\n",
    "weights = Dense(1, activation=\"softmax\", name=\"weights\")(q)\n",
    "q = Dot(axes=1, name=\"ques_out\")([weights, q])\n",
    "# Output is = a single 256d vector per question (None, 256, 1).\n",
    "\n",
    "# Outputs for Start & End / Quadratic Layers and Softmax\n",
    "qs = Dense(\n",
    "    256,\n",
    "    activation=\"linear\",\n",
    "    name=\"s1\",\n",
    "    use_bias=False,\n",
    "    kernel_regularizer=regularizers.l2(0.001),\n",
    ")(q)\n",
    "outs = Dot(axes=(2, 2), name=\"s2\")([p, qs])\n",
    "qe = Dense(\n",
    "    256,\n",
    "    activation=\"linear\",\n",
    "    name=\"e1\",\n",
    "    use_bias=False,\n",
    "    kernel_regularizer=regularizers.l2(0.001),\n",
    ")(q)\n",
    "oute = Dot(axes=(2, 2), name=\"e2\")([p, qe])\n",
    "# Output is = a probability vector (None, seq_pars, 1) for each\n",
    "\n",
    "# Model\n",
    "model = Model(inputs=[paragraphs, questions], outputs=[outs, oute])\n",
    "# print(BaseModel.summary())\n",
    "\n",
    "# Model Chart\n",
    "plot_model(model, to_file=\"Baseline.png\", show_shapes=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "  \n",
    "\n",
    "Let's now compile and train our model, using **Sparce Categorical Cross-Entropy Loss** as our loss function (you can read more about it <a href=\"https://cwiki.apache.org/confluence/display/MXNET/Multi-hot+Sparse+Categorical+Cross-entropy#:~:text=Categorical%20Cross%20Entropy-,Definition,only%20belong%20to%20one%20class.\" target=\"_blank\">**here**</a>) and **Sparse Categorical Accuracy** as our key metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compiling our model\n",
    "acc = tf.keras.metrics.SparseCategoricalAccuracy()\n",
    "opt = tf.keras.optimizers.Adamax()\n",
    "sce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n",
    "\n",
    "model.compile(\n",
    "    optimizer=opt, loss=[sce, sce], loss_weights=[1, 1], metrics=[[acc], [acc]]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Defining a checkpoint to save our best weights during Training\n",
    "checkpoint = ModelCheckpoint(\n",
    "    filepath=\"basemodel\",\n",
    "    frequency=\"epoch\",\n",
    "    save_weights_only=True,\n",
    "    save_best_only=True,\n",
    "    verbose=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Training (using the predefined checkpoint and our validation set)\n",
    "history = model.fit(\n",
    "    [p_train, q_train],\n",
    "    [ys_train, ye_train],\n",
    "    validation_data=([p_val, q_val], [ys_val, ye_val]),\n",
    "    epochs=1,\n",
    "    batch_size=64,\n",
    "    callbacks=[checkpoint],\n",
    "    verbose=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using our Best Model i.e. load the saved weights\n",
    "model.load_weights('basemodel')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation in terms of EM and F1 Scores\n",
    "We want to evaluate our model in terms of EM and F1 Score. The function below extracts the predicted spans, computes the EM and F1 scores on each datapoint, and returns an average. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to measure overall EM and F1 on the Test Set\n",
    "\n",
    "def create_masked_matrix(pred_start, pred_end):\n",
    "    # Creating the masked matrix of possible answers (where start < end < start 15)\n",
    "    masked_matrix = tf.matmul(pred_start, tf.transpose(pred_end, [0, 2, 1]))\n",
    "    i, j = np.meshgrid(\n",
    "        *map(np.arange, (masked_matrix.shape[1], masked_matrix.shape[2])), indexing=\"ij\"\n",
    "    )\n",
    "    masked_matrix.mask = (i <= j) & (j < i+15)\n",
    "    masked_matrix = np.where(masked_matrix.mask, masked_matrix, 0)\n",
    "    max_results = np.amax(masked_matrix, axis=(1, 2))\n",
    "    return masked_matrix, max_results\n",
    "\n",
    "\n",
    "def model_eval(pred):\n",
    "    pred_start = tf.exp(pred[0])\n",
    "    pred_end = tf.exp(pred[1])\n",
    "\n",
    "    masked_matrix, max_results = create_masked_matrix(pred_start, pred_end)\n",
    "    number_of_examples = masked_matrix.shape[0]\n",
    "\n",
    "    em = []\n",
    "    f1 = []\n",
    "    \n",
    "    # Find the most probable answer for each question in the test set.\n",
    "    # We compare with the four human answers, and keep the max F1 and EM scores.\n",
    "    for k in range(number_of_examples):\n",
    "        result = np.where(masked_matrix[k] == max_results[k])\n",
    "        if result[1][0] < len(text_to_word_sequence(test.context[k])): \n",
    "            answer = np.array(text_to_word_sequence(test.context[k]))[result[0][0]:result[1][0]+1]\n",
    "        else: answer = ['-']\n",
    "\n",
    "        if result[0][0] != result[1][0] and result[1][0] < len(\n",
    "            text_to_word_sequence(test.context[k])\n",
    "        ):\n",
    "            answer = \" \".join(answer)\n",
    "        else:\n",
    "            answer = str(answer[0])\n",
    "        em_k = max(\n",
    "            exact_match(answer, test.answer1[k]),\n",
    "            exact_match(answer, test.answer2[k]),\n",
    "            exact_match(answer, test.answer3[k]),\n",
    "            exact_match(answer, test.answer4[k]),\n",
    "        )\n",
    "        f1_k = max(\n",
    "            f1_score(answer, test.answer1[k]),\n",
    "            f1_score(answer, test.answer2[k]),\n",
    "            f1_score(answer, test.answer3[k]),\n",
    "            f1_score(answer, test.answer4[k]),\n",
    "        )\n",
    "        em.append(em_k)\n",
    "        f1.append(f1_k)\n",
    "    \n",
    "    print(\"Exact Match: \", np.round(np.mean(em), 3))\n",
    "    print(\"F1 Score: \", np.round(np.mean(f1), 3))\n",
    "           \n",
    "    return (em, f1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's embed and pad the Test set too...\n",
    "pars_test = []\n",
    "ques_test = []\n",
    "embed = hub.KerasLayer(\"https://tfhub.dev/google/nnlm-en-dim128/2\")  # NNLM\n",
    "\n",
    "for text in test.context:\n",
    "    words = np.array(text_to_word_sequence(text))\n",
    "    pars_test.append(embed(tf.constant(words)))\n",
    "for text in test.question:\n",
    "    words = np.array(text_to_word_sequence(text))\n",
    "    ques_test.append(embed(tf.constant(words)))\n",
    "\n",
    "p_test = pad_sequences(\n",
    "    pars_test, padding=\"post\", dtype=\"float32\", maxlen=max_paragraph_length\n",
    ")\n",
    "q_test = pad_sequences(\n",
    "    ques_test, padding=\"post\", dtype=\"float32\", maxlen=max_question_length\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluate the model on the Test set\n",
    "pred_test = model.predict([p_test, q_test])\n",
    "print(\"**Results on Test Set:\")\n",
    "(em_model, f1_model) = model_eval(pred_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our first Model - when trained on 70,000+ datapoints - obtains an **Exact Match Score of 11.0%, and a F1 Score of 23.8%** on the Test set. This is not ideal yet... Below are a few ideas on how we could improve from there - by adding complexity while making sure we keep the regularization in check!\n",
    "\n",
    " <img src=\"../assets/qa/ImproveModel.png\" alt=\"image\" style=\"width: 100%;\"/>\n",
    "\n",
    "We can improve the model in different ways:\n",
    "* **Adding features to each *pi***:\n",
    "    * **Binary Feature**: takes the value 1 if *pi* matches any one of the words in ***q***, 0 otherwise. An additional step would be to compare lemmas.\n",
    "    * **POS and NER Features**: to reflect some properties of each *pi* in its context, e.g its part-of-speech (POS) and named entity recognition (NER) tags - for example using <a href=\"https://spacy.io/models\" target=\"_blank\">Spacy</a> and the \"doc.to_array\" method.\n",
    "* **Adding Layers**: The main reason for stacking layers is to allow for greater model complexity - which is exactly what we're looking for here!  \n",
    "* **Bi-directionality**: A key idea of RNNs is to take in a sequence from left (past) to right (future) and preserve information from one word to the next. Conceptually, this implies each word is treated in the context of the words that came before it. But context is not unidirectional: using bi-directionality will allow the model to run from left to right and from right to left: it will preserve information from both past and future, and will be able to treat a word in its *full* context. You can read more about it and how it's implemented <a href=\"https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_networks\" target=\"_blank\">here</a>.  \n",
    "* **Dropout**: Dropout consists in \"dropping out\" neurons at random to force the model to rely on a wider range of features and avoid over-fitting. Concretely, it will choose to ignore neurons with a pre-defined probability in each forward/backward pass of the training phase. At test time however, Dropout will be turned off and all neurons will be able to play their parts. It is a powerful method for regularizing Neural Networks, and was first introduced <a href=\"https://en.wikipedia.org/wiki/Dilution_(neural_networks)\" target=\"_blank\">by Geoffrey Hinton, et al. in 2012</a>. A solid review of Dropout as applied to RNNs can be found <a href=\"https://medium.com/@bingobee01/a-review-of-dropout-as-applied-to-rnns-72e79ecd5b7b\" target=\"_blank\">here</a>.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a code snippet to add layers, bidirectionality as well as dropout. Compiling, training and evaluating would be done the exact same way as before:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First Input = Paragraphs\n",
    "paragraphs = Input(shape=(max_paragraph_length, emb_dim), name=\"par0\")\n",
    "p = Masking(mask_value=0)(paragraphs)\n",
    "\n",
    "# Bidirectional Multi-Layer with Dropout\n",
    "p = Bidirectional(\n",
    "    GRU(128, return_sequences=True, name=\"par1\", kernel_initializer=\"glorot_normal\"),\n",
    "    merge_mode=\"concat\",\n",
    ")(p)\n",
    "p = Dropout(0.15)(p)\n",
    "\n",
    "p = Bidirectional(\n",
    "    GRU(64, return_sequences=True, name=\"par2\", kernel_initializer=\"glorot_normal\"),\n",
    "    merge_mode=\"concat\",\n",
    ")(p)\n",
    "p = Dropout(0.15)(p)\n",
    "# Output is = a 128d vector per word in the paragraph (None, max_paragraph_length, 128).\n",
    "\n",
    "# Second Input = Questions\n",
    "questions = Input(shape=(max_question_length, emb_dim), name=\"ques0\")\n",
    "q = Masking(mask_value=0)(questions)\n",
    "q = GRU(256, return_sequences=True, name=\"ques2\")(q)\n",
    "q = Dropout(0.15)(q)\n",
    "# Output is = a 256d vector per word in the paragraph (None, max_question_length, 256).\n",
    "\n",
    "# Weighted Average to obtain the single vector q'\n",
    "weights = Dense(1, activation=\"softmax\", name=\"weights\")(q)\n",
    "q = Dot(axes=1, name=\"ques3\")([weights, q])\n",
    "# Output is = a single 256d vector per question (None, 256, 1).\n",
    "\n",
    "# Outputs for Start & End / Quadratic Layers and Softmax\n",
    "qs = Dense(128, activation = 'linear', name = \"s1\", use_bias=False, \n",
    "           kernel_regularizer=regularizers.l2(0.002))(q)\n",
    "outs = Dot(axes=(2, 2), name = \"s2\")([p, qs])\n",
    "\n",
    "qe = Dense(128, activation = 'linear', name = \"e1\", use_bias=False, \n",
    "           kernel_regularizer=regularizers.l2(0.002))(q)\n",
    "oute = Dot(axes=(2, 2), name = \"e2\")([p, qe])\n",
    "# Output is = a probability vector (None, seq_pars, 1) for each\n",
    "\n",
    "# Model\n",
    "model = Model(inputs=[paragraphs, questions], outputs=[outs, oute])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's explore our Results!\n",
    "With this Bidirectional Multi-Layer model, trained on 70,000+ datapoints, we managed to obtain **an Exact Match Score of 22.8%, and a F1 Score of 45.1%** on the Test set - quite the improvement on the first model - although more could be done via hyperparameter tuning, smarter initialization or using a larger training dataset. \n",
    "\n",
    "Let's define a function that will allow us to **explore the performance of our model** on the Test set: the function will take the index of a datapoint, and return the question, context and ground-truth answer (the set of human answers), as well as our model's answer and its EM/F1 scores for this particular example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_masked_matrix_for_one(pred_start, pred_end):\n",
    "    # Creating the masked matrix of possible answers (where start < end < start 15)\n",
    "    masked_matrix = tf.matmul(pred_start, tf.transpose(pred_end))\n",
    "    i, j = np.meshgrid(\n",
    "        *map(np.arange, (masked_matrix.shape)), indexing=\"ij\"\n",
    "    )\n",
    "    masked_matrix.mask = (i <= j) & (j < i+15)\n",
    "    masked_matrix = np.where(masked_matrix.mask, masked_matrix, 0)\n",
    "    max_results = np.where(masked_matrix == np.amax(masked_matrix)) \n",
    "    return masked_matrix, max_results\n",
    "\n",
    "\n",
    "# Function to get the result on the kth question\n",
    "def get_result(k, model=model, verbose=True):\n",
    "\n",
    "    paragraph = tf.expand_dims(p_test[k], 0)\n",
    "    question = tf.expand_dims(q_test[k], 0)\n",
    "    out = model([paragraph, question])\n",
    "    start = tf.exp(out[0][0])\n",
    "    end = tf.exp(out[1][0])\n",
    "    \n",
    "    _, result = create_masked_matrix_for_one(start, end)\n",
    "        \n",
    "    if result[1][0] < len(text_to_word_sequence(test.context[k])): \n",
    "        answer = np.array(text_to_word_sequence(test.context[k]))[result[0][0]:result[1][0]+1]\n",
    "    else: answer = ['-']\n",
    "\n",
    "    if result[0][0] != result[1][0] and result[1][0] < len(text_to_word_sequence(test.context[k])): \n",
    "        answer = \" \".join(answer)\n",
    "    else: answer = str(answer[0])\n",
    "    if verbose:\n",
    "        print(\"--------------------------------------------------------\")\n",
    "        print(\"Question: \", test.question[k])\n",
    "        print(\"--------------------------------------------------------\")\n",
    "        print(\"Context: \")\n",
    "        print(test.context[k])\n",
    "    print(\"--------------------------------------------------------\")\n",
    "    print(\"Model's answer: \", answer)\n",
    "    print(\"Human answers: \")\n",
    "    print(\n",
    "        test.answer1[k],\n",
    "        \" -- \",\n",
    "        test.answer2[k],\n",
    "        \" -- \",\n",
    "        test.answer3[k],\n",
    "        \" -- \",\n",
    "        test.answer4[k],\n",
    "    )\n",
    "    print(\"--------------------------------------------------------\")\n",
    "    print(\n",
    "        \"EM Score: \",\n",
    "        max(\n",
    "            exact_match(answer, test.answer1[k]),\n",
    "            exact_match(answer, test.answer2[k]),\n",
    "            exact_match(answer, test.answer3[k]),\n",
    "            exact_match(answer, test.answer4[k]),\n",
    "        ),\n",
    "    )\n",
    "    print(\n",
    "        \"F1 Score: \",\n",
    "        np.round(\n",
    "            max(\n",
    "                f1_score(answer, test.answer1[k]),\n",
    "                f1_score(answer, test.answer2[k]),\n",
    "                f1_score(answer, test.answer3[k]),\n",
    "                f1_score(answer, test.answer4[k]),\n",
    "            ),\n",
    "            3,\n",
    "        ),\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let's try...\n",
    "get_result(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A few examples:\n",
    "During our various experiments with our most advanced model, we had fun and explored a few examples from our Test set as well as from our Validation set:\n",
    "\n",
    " <img src=\"../assets/qa/Ex1.png\" alt=\"image\" style=\"width: 80%;\"/>  \n",
    "While results were sometimes spot-on, we of course still saw some wrong answers...!  \n",
    "Below are some insteresting results where the model doesn't get a full F1 Score, although one could easily argue it was right:\n",
    " <img src=\"../assets/qa/Ex2.png\" alt=\"image\" style=\"width: 80%;\"/> \n",
    "\n",
    "\n",
    "**Another idea to think about**: what if you not only your model to find the most likely answer in a snippet of text, but you also wanted it to be able to say the answer is NOT in the snippet, and return a \"no answer\" instead? How would you structure the model and data then?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------\n",
    "\n",
    "Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-gpu.2-1.m61",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m61"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
